Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe#1363
Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe#1363
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds FP8 scale sweep stride control to calibration workflows, introduces three mixed-precision NVFP4 quantization recipes for Nemotron-3-Super-120B with different calibration methods (MSE, MSE with stride-4 sweep, max-based), refactors MoE calibration completeness checks to recursively traverse SequentialQuantizer leaves, and overhauls HuggingFace export to collect and apply per-layer quantization metadata. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 29-32: The metadata claims attention o_proj is FP8 per-tensor but
quant_cfg lacks any override for attention.o_proj; either update the description
or add the missing quantizer mapping. Fix by adding an explicit quant_cfg
override for the attention o_proj parameter name (e.g., attention.o_proj /
attention.output_projection / whatever key is used in your model mapping) to use
the FP8 per-tensor quantizer used elsewhere, or remove the attention o_proj
mention from the metadata so it matches the existing quant_cfg; ensure you
reference the exact layer key used in quant_cfg to keep mapping consistent with
the model.
- Around line 94-115: The entries using broad quantizer_name patterns
('*mixer.fc1_latent_proj*weight_quantizer',
'*mixer.fc1_latent_proj*input_quantizer',
'*mixer.fc2_latent_proj*weight_quantizer',
'*mixer.fc2_latent_proj*input_quantizer') are enabling FP8 for every latent
projection instead of only layers 1, 3, and 5; update these quantizer_name
values to target only the specific layer instances (e.g. include the layer
index/identifier for layers 1, 3, and 5 in the wildcard or use a regex/explicit
list) so only those mixer.fc1_latent_proj and mixer.fc2_latent_proj quantizers
are set to num_bits: e4m3, and leave all other latent projection quantizers at
BF16 (or remove the generic entries).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 60ed9c0b-efef-4967-b321-7270a8853455
📒 Files selected for processing (1)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1363 +/- ##
==========================================
- Coverage 76.93% 70.44% -6.50%
==========================================
Files 471 477 +6
Lines 50404 55095 +4691
==========================================
+ Hits 38776 38809 +33
- Misses 11628 16286 +4658
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/model_calib.py (1)
389-412:⚠️ Potential issue | 🟡 MinorCustom FP8-sweep backends still ignore the new stride setting.
When a registered
backend_factoryis used, this branch still calls the old 3-argument factory signature, sofp8_scale_sweep_strideonly takes effect on the built-inNVFP4MSECalibratorpath below. That makes the new config silently no-op for registry-backed sweep calibrators. Please extend the factory contract or reject non-default stride explicitly in the backend path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 389 - 412, The registered backend factory path (lookup via _FP8_SWEEP_CALIBRATOR_REGISTRY and assigned to backend_factory) currently calls the factory with the old 3-argument signature and thus ignores fp8_scale_sweep_stride; update the branch that sets module._calibrator via backend_factory to either (A) call the factory with the new argument (pass fp8_scale_sweep_stride) and update the factory contract accordingly, or (B) explicitly detect a non-default fp8_scale_sweep_stride and raise/rollback with a clear error so users know registry-backed calibrators do not support stride; ensure the call still passes initial_amax, module._calibrator._axis, partial(_mse_quant_func, quantizer=module) and include fp8_scale_sweep_stride when choosing option A, mirroring how NVFP4MSECalibrator is constructed.
♻️ Duplicate comments (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)
41-42:⚠️ Potential issue | 🟡 MinorThe metadata description still overstates what this recipe quantizes.
These lines say
attention o_proj,fc1_latent_proj, andfc2_latent_projare FP8 per-tensor, but there are no matching overrides inquant_cfg, and the header comments say the latent MoE projections stay BF16. Please update the description so it matches the actual recipe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around lines 41 - 42, The metadata description in the YAML (the description field) incorrectly claims that "attention o_proj", "fc1_latent_proj", and "fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching overrides in quant_cfg and header comments state latent MoE projections remain BF16/FP16; update the description text to accurately reflect the recipe by removing or changing the FP8 claims (e.g., state that latent MoE projections and those specific projections remain BF16/FP16 and only list the layers that the quant_cfg actually overrides as FP8 per-tensor).modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)
34-35:⚠️ Potential issue | 🟡 MinorThe metadata description does not match the actual quantizer mapping.
These lines still claim
attention o_proj,fc1_latent_proj, andfc2_latent_projare FP8 per-tensor, but this recipe never enables those quantizers, and the header comments above say latent MoE stays BF16. Please align the description with thequant_cfgthat follows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml` around lines 34 - 35, The YAML description string claims that "attention o_proj, fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in this recipe does not enable quantizers for "attention.o_proj", "fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays BF16/FP16); fix this by either updating the description to state those projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify quant_cfg to actually enable FP8 per-tensor quantizers for the keys "attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment matches the mapping.
🧹 Nitpick comments (1)
modelopt/torch/quantization/calib/mse.py (1)
202-206: Add a regression test for strided FP8 candidate generation.The existing coverage only exercises the default 126-candidate path. This branch adds two new behaviors—subsampling and forced inclusion of the last candidate—so it should have a focused test for
fp8_scale_sweep_stride > 1to lock down both the reduced candidate count and preservation of the max scale.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/calib/mse.py` around lines 202 - 206, The strided FP8 candidate-generation branch guarded by fp8_scale_sweep_stride > 1 (in the block that subsamples fp8_values into candidates and appends the last value) is untested; add a focused regression test (e.g., test_fp8_scale_sweep_stride_preserves_last_candidate) that sets fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and asserts that the resulting candidates length is reduced according to the stride and that the final element equals the original fp8_values[-1] (verifying forced inclusion of the max scale); ensure the test covers both subsampling and the append behavior so the branch is locked down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 16-28: Remove the unresolved Git merge conflict markers (<<<<<<<,
=======, >>>>>>>) and restore a single coherent comment block describing the
quant config; keep the more detailed version that lists both HF and
Megatron-Core names (the lines mentioning mixer.experts.<N>.{up,down}_proj,
mlp.experts.local_experts.<N>.linear_fc{1,2},
mixer.shared_experts.{up,down}_proj, and mlp.shared_experts.linear_fc{1,2}) or
merge its additional details into the shorter variant so the YAML comment is
valid and free of conflict markers.
---
Outside diff comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 389-412: The registered backend factory path (lookup via
_FP8_SWEEP_CALIBRATOR_REGISTRY and assigned to backend_factory) currently calls
the factory with the old 3-argument signature and thus ignores
fp8_scale_sweep_stride; update the branch that sets module._calibrator via
backend_factory to either (A) call the factory with the new argument (pass
fp8_scale_sweep_stride) and update the factory contract accordingly, or (B)
explicitly detect a non-default fp8_scale_sweep_stride and raise/rollback with a
clear error so users know registry-backed calibrators do not support stride;
ensure the call still passes initial_amax, module._calibrator._axis,
partial(_mse_quant_func, quantizer=module) and include fp8_scale_sweep_stride
when choosing option A, mirroring how NVFP4MSECalibrator is constructed.
---
Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The YAML description string claims that "attention o_proj,
fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in
this recipe does not enable quantizers for "attention.o_proj",
"fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays
BF16/FP16); fix this by either updating the description to state those
projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify
quant_cfg to actually enable FP8 per-tensor quantizers for the keys
"attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment
matches the mapping.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 41-42: The metadata description in the YAML (the description
field) incorrectly claims that "attention o_proj", "fc1_latent_proj", and
"fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching
overrides in quant_cfg and header comments state latent MoE projections remain
BF16/FP16; update the description text to accurately reflect the recipe by
removing or changing the FP8 claims (e.g., state that latent MoE projections and
those specific projections remain BF16/FP16 and only list the layers that the
quant_cfg actually overrides as FP8 per-tensor).
---
Nitpick comments:
In `@modelopt/torch/quantization/calib/mse.py`:
- Around line 202-206: The strided FP8 candidate-generation branch guarded by
fp8_scale_sweep_stride > 1 (in the block that subsamples fp8_values into
candidates and appends the last value) is untested; add a focused regression
test (e.g., test_fp8_scale_sweep_stride_preserves_last_candidate) that sets
fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and
asserts that the resulting candidates length is reduced according to the stride
and that the final element equals the original fp8_values[-1] (verifying forced
inclusion of the max scale); ensure the test covers both subsampling and the
append behavior so the branch is locked down.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0d72ce13-180e-46ea-b4d8-8c6c140d22a7
📒 Files selected for processing (5)
modelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
There was a problem hiding this comment.
♻️ Duplicate comments (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)
35-36:⚠️ Potential issue | 🟡 MinorMetadata precision mapping is inconsistent with the actual recipe.
Line 35 says latent MoE
fc1_latent_proj/fc2_latent_projare FP8 per-tensor, but this file has no latent-MoE quantizer overrides and the header (Line 27) says they stay BF16. Please align the description withquant_cfgto avoid confusion.Proposed patch
metadata: recipe_type: ptq - description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj - FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with FP8 scale sweep. + description: >- + Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); + shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE, + lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32 + (optionally FP16 in vLLM). Weight-MSE calibration with FP8 scale sweep.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around lines 35 - 36, The description's claim that latent MoE "fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the quant_cfg (and header) which leave those layers as BF16; either update the description string to state that fc1_latent_proj and fc2_latent_proj remain BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8 per-tensor (matching the rest of the FP8 settings); ensure the description text and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are aligned.modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)
34-35:⚠️ Potential issue | 🟡 MinorDescription/comments conflict on latent MoE and SSM/mamba precision.
Lines 34-35 state latent MoE is FP8 per-tensor, but the recipe does not enable latent-MoE quantizers. Also, Lines 134-135 conflict with earlier comments on SSM/mamba precision. Please make these statements internally consistent.
Proposed patch
metadata: recipe_type: ptq - description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj - FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with stride-4 FP8 scale sweep. + description: >- + Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); + shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE, + lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32 + (optionally FP16 in vLLM). Weight-MSE calibration with stride-4 FP8 scale sweep. @@ - # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head. - # SSM state / mamba conv1d stay FP16. + # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head, mamba conv1d. + # SSM state stays FP32 (can be set to FP16 in vLLM).Also applies to: 134-135
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml` around lines 34 - 35, The description text claims "latent MoE is FP8 per-tensor" and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by making the YAML fields match the prose: either add the latent-MoE quantizer key (e.g., include latent_moe in the quantizers list or set enable_latent_moe_quantizers: true) and ensure its precision is set to FP8 per-tensor, or change the description to remove the FP8 latent-MoE claim; likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba precision fields to uniformly state BF16/FP16 (or change the description to reflect the actual configured precisions) so the "description" string and the quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision, mamba_precision, lm_head_precision / mtp_precision) are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description text claims "latent MoE is FP8 per-tensor"
and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable
latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by
making the YAML fields match the prose: either add the latent-MoE quantizer key
(e.g., include latent_moe in the quantizers list or set
enable_latent_moe_quantizers: true) and ensure its precision is set to FP8
per-tensor, or change the description to remove the FP8 latent-MoE claim;
likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba
precision fields to uniformly state BF16/FP16 (or change the description to
reflect the actual configured precisions) so the "description" string and the
quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision,
mamba_precision, lm_head_precision / mtp_precision) are consistent.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description's claim that latent MoE
"fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the
quant_cfg (and header) which leave those layers as BF16; either update the
description string to state that fc1_latent_proj and fc2_latent_proj remain
BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg
for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8
per-tensor (matching the rest of the FP8 settings); ensure the description text
and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are
aligned.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 595b4531-3ced-4894-a5cc-c28f161c8f3e
📒 Files selected for processing (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
| # Megatron-Core names: mlp.shared_experts.linear_fc{1,2} | ||
| # - Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor | ||
| # - KV cache: FP8 | ||
| # - Attention linears ({q,k,v}_proj): BF16 (not quantized) |
There was a problem hiding this comment.
Can we double check attention out linear? IIRC, attention o_proj should be FP8.
There was a problem hiding this comment.
responded in slack, only 2/9 attention layers had o_proj FP8 in final Super NVFP4 ckpt, but we can always add it later to test if accuracy degradation is minimal
|
/claude review |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
@claude review |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/unified_export_megatron.py (1)
1213-1229:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRecord packed-expert quant metadata against the module prefix, not the weight key.
In both pack-remap helpers,
prefixis the tensor key written intostate_dict(for example...weight). Recording that verbatim produceslayer_config_dictentries like...weight.quantization, andprocess_layer_quant_config()will then emitquantized_layersnames ending in.weightinstead of the HF module prefix. Serving-side layer matching will miss those packed experts.Suggested fix
- self._record_layer_quant_config(prefix, qformat, block_size) + module_prefix = prefix.rsplit(".", 1)[0] + "." + self._record_layer_quant_config(module_prefix, qformat, block_size)Apply the same normalization in both
_pack_name_remapping()and_pack_name_remapping_gpt_oss().Also applies to: 1280-1298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_megatron.py` around lines 1213 - 1229, The code records quant metadata under the full tensor key (e.g., "...weight") causing layer names to end with ".weight"; update both _pack_name_remapping and _pack_name_remapping_gpt_oss to normalize the prefix before calling self._record_layer_quant_config by stripping the tensor suffix (e.g., remove trailing ".weight" or the last dot component) so the module-level HF prefix is recorded instead of the weight key; apply the same normalization logic where self._record_layer_quant_config(prefix, qformat, block_size) is invoked in both functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yaml`:
- Around line 16-36: The recipe metadata/comments misstate Latent MOE behavior:
fc1_latent_proj/fc2_latent_proj are described as FP8 per-tensor but the
quant_cfg (and lines noting "stay BF16/FP16") never enable those quantizers;
either update the human-readable description to say Latent MOE projections
remain BF16 (or FP16 per existing comment) to match quant_cfg, or enable FP8
per-tensor quantizers for fc1_latent_proj and fc2_latent_proj in the quant_cfg
so the comment matches behavior—make the change by editing the
metadata/description block and/or the quant_cfg entries that reference
fc1_latent_proj and fc2_latent_proj accordingly.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description erroneously claims the Latent MoE is "FP8
per-tensor" while the recipe does not enable quantizer patterns for
fc1_latent_proj or fc2_latent_proj; update the metadata description string (the
YAML "description" field) to remove or correct the FP8 statement for Latent MoE
(or explicitly state that fc1_latent_proj/fc2_latent_proj remain BF16/FP16) so
it matches the actual quantizer configuration for fc1_latent_proj and
fc2_latent_proj.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 35-36: The description claims latent MoE is FP8 and that static
scales are "chosen by MSE", but the YAML sets no quantizers for
fc1_latent_proj/fc2_latent_proj and uses method: max; update the human-readable
description to match the actual config by either (A) enabling the latent
projection quantizers (fc1_latent_proj/fc2_latent_proj) to make latent MoE FP8,
or (B) explicitly state latent projections remain unquantized (not FP8) and
change the phrase "static scales are chosen by MSE" to reflect the configured
method (e.g., "static scales computed by max" or "method: max"); adjust the
description lines referencing "FP8 per-tensor; ... Latent MOE
fc1_latent_proj/fc2_latent_proj" and the sentence about static scale selection
accordingly so the text and the fields (fc1_latent_proj, fc2_latent_proj, and
method: max) are consistent.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description string misreports the latent-MoE policy:
update the description (the YAML description field) to reflect that
fc1_latent_proj and fc2_latent_proj are left unquantized (per the quant_cfg)
rather than being "FP8 per-tensor"; edit the text around the FP8/KV/latent-MoE
sentence to explicitly state that fc1_latent_proj/fc2_latent_proj remain
unquantized (or their actual precision) and keep FP8 per-tensor/KV wording only
for the tensors that truly use FP8.
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 815-818: The current check only treats qformat is None as
excluded, but QUANTIZATION_NONE should be treated the same so those layers are
recorded as excluded (and not dropped from the exported quant config); update
the conditional in unified_export_megatron.py where qformat is obtained from
_get_quantization_format(module) to also consider QUANTIZATION_NONE (e.g., if
qformat is None or qformat == QUANTIZATION_NONE) before calling
_record_excluded_module(prefix), and ensure any references to
_record_layer_quant_config still see these explicitly disabled quantizers as
excluded rather than omitted.
---
Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 1213-1229: The code records quant metadata under the full tensor
key (e.g., "...weight") causing layer names to end with ".weight"; update both
_pack_name_remapping and _pack_name_remapping_gpt_oss to normalize the prefix
before calling self._record_layer_quant_config by stripping the tensor suffix
(e.g., remove trailing ".weight" or the last dot component) so the module-level
HF prefix is recorded instead of the weight key; apply the same normalization
logic where self._record_layer_quant_config(prefix, qformat, block_size) is
invoked in both functions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3adef0ee-1161-4068-a796-fc857c2455fd
📒 Files selected for processing (11)
modelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yamltests/gpu_megatron/torch/export/test_unified_export_megatron.py
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
modelopt/torch/quantization/qtensor/nvfp4_tensor.py (1)
120-122: 💤 Low value
_get_static_global_amaxis called twice for the static branch.
_is_static_quantizerinternally calls_get_static_global_amaxand discards the result; line 122 then calls it again. Consider hoisting to a single variable:♻️ Proposed refactor
- if cls._is_static_quantizer(weight_quantizer): - # Static path: use pre-computed per-block amax values from quantizer - global_amax = cls._get_static_global_amax(weight_quantizer).float() + global_amax = cls._get_static_global_amax(weight_quantizer) + if global_amax is not None: + # Static path: use pre-computed per-block amax values from quantizer + global_amax = global_amax.float()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py` around lines 120 - 122, Hoist the call to _get_static_global_amax so it's executed only once: call amax = cls._get_static_global_amax(weight_quantizer) (and .float() as needed) and reuse that value instead of calling _get_static_global_amax again; update _is_static_quantizer to accept an optional precomputed amax parameter or refactor _is_static_quantizer to avoid calling _get_static_global_amax internally so the static branch uses the hoisted amax (refer to _is_static_quantizer, _get_static_global_amax and weight_quantizer).tests/gpu_megatron/torch/export/test_unified_export_megatron.py (1)
151-164: ⚡ Quick winAdd one end-to-end mixed-precision export case.
The new export path here is the per-layer
layer_config_dict/process_layer_quant_configflow, but this matrix still only exercises uniform configs (NVFP4_DEFAULT_CFG/FP8_DEFAULT_CFG). A single mixed recipe case would catch regressions inquantized_layers, excludes, and theconfig.json↔hf_quant_config.jsonparity you just tightened.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py` around lines 151 - 164, Add a new parametrized test case to the existing pytest.mark.parametrize matrix (the tuple of ("model_type", "arch", "extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer mixed-precision export path: supply a quant_config that triggers the layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json parity; update the matrix alongside existing entries (referencing the parameters model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a mixed per-layer config for either nemotron or llama.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 45-79: The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.
---
Nitpick comments:
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 120-122: Hoist the call to _get_static_global_amax so it's
executed only once: call amax = cls._get_static_global_amax(weight_quantizer)
(and .float() as needed) and reuse that value instead of calling
_get_static_global_amax again; update _is_static_quantizer to accept an optional
precomputed amax parameter or refactor _is_static_quantizer to avoid calling
_get_static_global_amax internally so the static branch uses the hoisted amax
(refer to _is_static_quantizer, _get_static_global_amax and weight_quantizer).
In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 151-164: Add a new parametrized test case to the existing
pytest.mark.parametrize matrix (the tuple of ("model_type", "arch",
"extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer
mixed-precision export path: supply a quant_config that triggers the
layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config
rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises
quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json
parity; update the matrix alongside existing entries (referencing the parameters
model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a
mixed per-layer config for either nemotron or llama.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 94f3528f-2ee4-4757-b8de-3a323300ca74
📒 Files selected for processing (10)
modelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yamltests/gpu_megatron/torch/export/test_unified_export_megatron.py
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | ||
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | ||
| - quantizer_name: '*mixer.experts.*weight_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| - quantizer_name: '*mixer.experts.*input_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | ||
| - quantizer_name: '*mlp.experts*weight_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| - quantizer_name: '*mlp.experts*input_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 |
There was a problem hiding this comment.
Keep routed-expert weights static in the max-calib variant.
These two *...weight_quantizer entries are type: dynamic, so this recipe does more than swap method: mse for method: max—it changes routed-expert weights to dynamic NVFP4 and bypasses the static-weight calibration path entirely. That makes this variant a different quantization recipe, not a max-calibrated version of super-nvfp4.yaml.
Suggested fix
- quantizer_name: '*mixer.experts.*weight_quantizer'
enable: true
cfg:
block_sizes:
- type: dynamic
+ type: static
scale_bits: e4m3
num_bits: e2m1
@@
- quantizer_name: '*mlp.experts*weight_quantizer'
enable: true
cfg:
block_sizes:
- type: dynamic
+ type: static
scale_bits: e4m3
num_bits: e2m1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | |
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | |
| - quantizer_name: '*mixer.experts.*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mixer.experts.*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | |
| - quantizer_name: '*mlp.experts*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mlp.experts*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | |
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | |
| - quantizer_name: '*mixer.experts.*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: static | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mixer.experts.*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | |
| - quantizer_name: '*mlp.experts*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: static | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mlp.experts*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`
around lines 45 - 79, The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.
| if hasattr(self, "kv_cache_dtype"): | ||
| self._hf_quant_config["quantization"]["kv_cache_quant_algo"] = self.kv_cache_dtype | ||
| # Use one serving-facing config for both hf_quant_config.json and config.json. | ||
| self._hf_quant_config = convert_hf_quant_config_format(raw_hf_quant_config) |
There was a problem hiding this comment.
do we change the format of hf_quant_config.json by this change?
There was a problem hiding this comment.
yes to align it with the quant config inside config.json. previously hf_quant_config was still using the older json format whereas the quant config inside config.json was being converted to a newer format
| return global_amax is not None | ||
|
|
||
| @classmethod | ||
| def _get_static_global_amax(cls, weight_quantizer): |
There was a problem hiding this comment.
Thanks for catching the bug! I'd like to understand why in mcore path we use _global_amax instead of global_amax, is there any general guidance for when to use private attribute name and when to use public ones? I think we have similar issue with amax vs _amax
There was a problem hiding this comment.
during checkpoint resume in MCore it initializes the quantizers as generic TensorQuantizer and those have _global_amax parameter (instead of NVFP4QTensor which has global_amax)
There was a problem hiding this comment.
@Fridah-nv I have changed MCore static quantizer restore to restore as NVFP4QTensor in mtq.plugins.custom
Fridah-nv
left a comment
There was a problem hiding this comment.
Changes in MSE flow LGTM
|
/claude review |
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
1a10eeb to
8fc569d
Compare
| axis=module._calibrator._axis, | ||
| global_amax=module.global_amax, | ||
| quant_func=partial(_mse_quant_func, quantizer=module), | ||
| fp8_scale_sweep_stride=fp8_scale_sweep_stride, |
There was a problem hiding this comment.
[CRITICAL Algorithm] NVFP4MSECalibrator.__init__ does not accept a fp8_scale_sweep_stride parameter (see calib/mse.py:177-184). Passing it here will raise TypeError: __init__() got an unexpected keyword argument 'fp8_scale_sweep_stride' at runtime whenever fp8_scale_sweep=True and the quantizer is an NVFP4 static quantizer.
Either:
- Add
fp8_scale_sweep_stridetoNVFP4MSECalibrator.__init__and use it in_generate_candidatesto subsample the candidate list (e.g.,fp8_values[::stride]), or - Remove this kwarg from the call site until the calibrator supports it.
| if hasattr(self, "kv_cache_dtype"): | ||
| self._hf_quant_config["quantization"]["kv_cache_quant_algo"] = self.kv_cache_dtype | ||
| # Use one serving-facing config for both hf_quant_config.json and config.json. | ||
| self._hf_quant_config = convert_hf_quant_config_format(raw_hf_quant_config) |
There was a problem hiding this comment.
[IMPORTANT Compatibility] This changes the schema of hf_quant_config.json from the raw ModelOpt format ({"producer": {...}, "quantization": {...}}) to the compressed-tensors-compatible format ({"config_groups": {...}, "ignore": [...], "quant_algo": ...}).
The HF export path (unified_export_hf.py:1190-1194) still writes the raw format to hf_quant_config.json and the converted format to config.json. This PR makes the MCore export path diverge by writing converted format in both places.
Any downstream tools (TRT-LLM build scripts, vLLM loaders) that read hf_quant_config.json from MCore-exported checkpoints and expect the "quantization" top-level key will break. If this is intentional (which the test updates suggest), please mention the schema change in the PR description and/or changelog so consumers know to update.
| if _has_state(self.weight_quantizer, "_amax") and not _has_complete_static_nvfp4_weight_state( | ||
| self.weight_quantizer, self.weight | ||
| ): |
There was a problem hiding this comment.
[IMPORTANT Compatibility] The condition _has_state(wq, "_amax") and not _has_complete_static_nvfp4_weight_state(wq, self.weight) means: if _amax exists but the quantizer is NOT a complete static NVFP4, recalibrate. Previously it always recalibrated when _amax existed.
However, _has_complete_static_nvfp4_weight_state checks global_amax = getattr(quantizer, "global_amax", None) which relies on the global_amax property of NVFP4StaticQuantizer (which in turn reads _global_amax buffer). If the quantizer class hasn't been upgraded to NVFP4StaticQuantizer yet at this point during restore (because modelopt_post_restore is called before the megatron plugin's set_extra_state → from_tensor_quantizer), then global_amax will be None (plain TensorQuantizer doesn't have this property).
In that case, the function returns False, falling through to recalibrate — which is the safe/correct fallback. So this works correctly, but the "preserve MSE scales" path only activates when the quantizer has already been upgraded. Is that the intended sequence? If the HF restore path always upgrades first (via the _check_unsupported_states / from_tensor_quantizer flow), this is fine. Just flagging for awareness.
| for leaf_quantizer in _iter_leaf_quantizers(quantizer): | ||
| if _is_dynamic_block_quantizer(leaf_quantizer): | ||
| continue | ||
| leaf_quantizer.sync_amax_across_distributed_group(parallel_state.data_parallel_group) | ||
| leaf_quantizer.sync_amax_across_distributed_group( | ||
| parallel_state.expert_model_parallel_group |
There was a problem hiding this comment.
[SUGGESTION] The old code guarded sync_amax_across_distributed_group with if getattr(quantizer, "_amax", None) is not None. The new code removes this guard and relies on the internal check inside sync_amax_across_distributed_group.
While functionally correct (the method has if ... getattr(self, "_amax", None) is not None internally), this introduces unnecessary all_reduce call-site overhead for dynamic block quantizers that were already skipped by _is_dynamic_block_quantizer. However, non-dynamic quantizers that happen to not have _amax (e.g., disabled quantizers yielded by _iter_leaf_quantizers) will now call into sync_amax_across_distributed_group only to no-op internally. Minor perf consideration — no functional bug.
| self._record_layer_quant_config(q_proj_prefix, qformat, block_size) | ||
| self._record_layer_quant_config(k_proj_prefix, qformat, block_size) | ||
| self._record_layer_quant_config(v_proj_prefix, qformat, block_size) | ||
| if qformat in (None, QUANTIZATION_NONE): | ||
| # MCore stores Q/K/V as one fused linear_qkv module, but HF exports them | ||
| # as separate q_proj/k_proj/v_proj modules. Record the HF names so | ||
| # runtime quant configs do not miss excluded fused-QKV projections. | ||
| fused_prefix = prefix.removesuffix(".") | ||
| self.exclude_modules = [m for m in self.exclude_modules if m != fused_prefix] | ||
| self._record_excluded_module(q_proj_prefix) | ||
| self._record_excluded_module(k_proj_prefix) | ||
| self._record_excluded_module(v_proj_prefix) |
There was a problem hiding this comment.
[SUGGESTION] When qformat in (None, QUANTIZATION_NONE), _record_excluded_module is called with q_proj_prefix, k_proj_prefix, v_proj_prefix which all end with ".". The method correctly strips the trailing dot via removesuffix(".").
However, the list-comprehension self.exclude_modules = [m for m in self.exclude_modules if m != fused_prefix] removes the original fused module entry. If _qkv_slicing is called multiple times for the same prefix (e.g., from different pipeline stages in distributed), the second call would find fused_prefix already absent — which is fine. But the deduplication inside _record_excluded_module (the if layer_name not in self.exclude_modules check) prevents the individual proj names from being double-added too. Good.
One minor observation: if prefix is "" (empty), then fused_prefix = "".removesuffix(".") is also "", and the list-comp would try to remove "" from exclude_modules. You've guarded _record_excluded_module against empty strings, but the fused_prefix removal isn't similarly guarded. Consider adding if fused_prefix: before the list-comp to be consistent.
| start_multiplier: float = 0.25, | ||
| stop_multiplier: float = 4.0, | ||
| fp8_scale_sweep: bool = False, | ||
| fp8_scale_sweep_stride: int = 1, |
There was a problem hiding this comment.
[IMPORTANT Compatibility] fp8_scale_sweep_stride is added to mse_calibrate() but not to MseCalibConfig (see config.py:1287). Since wrapped_calib_func calls func(model, forward_loop=forward_loop, **config.model_dump()), this parameter can never flow through the standard ModelOpt mode/config pipeline (e.g., when using apply_mode with a recipe).
For recipes like super-nvfp4.yaml to actually use fp8_scale_sweep_stride, it needs to be added to MseCalibConfig (or a new config field) and parsed from the recipe YAML algorithm section. Without this, the parameter is only usable via direct mse_calibrate() calls, which limits the recipe-driven workflow.
There was a problem hiding this comment.
Code Review Summary
Findings: CRITICAL: 1, IMPORTANT: 3, SUGGESTION: 2
Critical
fp8_scale_sweep_stridepassed toNVFP4MSECalibratorwhich doesn't accept it (model_calib.py:412). This will raise aTypeErrorat runtime wheneverfp8_scale_sweep=Trueand the quantizer matches the NVFP4 static path. The constructor atcalib/mse.py:177-184has no such parameter.
Important
-
hf_quant_config.jsonschema change (unified_export_megatron.py:354). MCore exports now write the converted (compressed-tensors) format tohf_quant_config.json, breaking downstream tools that expect the raw{"producer": {...}, "quantization": {...}}format. The HF export path still writes the old format to this file. -
fp8_scale_sweep_stridenot inMseCalibConfig(model_calib.py:335). The parameter can only be used via directmse_calibrate()calls, not through the standard recipe/mode pipeline, limiting its utility for the new YAML recipes. -
Static NVFP4 preserve-amax path timing (
plugins/custom.py:164). The "preserve MSE scales" logic depends on the quantizer already being upgraded toNVFP4StaticQuantizer. If upgrade hasn't happened yet (depends on restore ordering), it safely falls through to recalibration — but this means MSE scales may not be preserved in all restore paths.
Suggestions
- Guard fused_prefix removal against empty string (
unified_export_megatron.py:1074). - Minor perf: removed
_amaxguard (model_calib.py:196-201). Functionally correct but calls into sync method for quantizers without_amax.
Overall Assessment
Risk: Medium-High. The critical TypeError bug (#1) will crash any user path that invokes mse_calibrate with fp8_scale_sweep=True on a model with NVFP4 static quantizers — which is the exact use case of the new super-nvfp4.yaml recipe. The hf_quant_config.json schema change (#2) is a compatibility concern for consumers of MCore-exported checkpoints.
The core logic changes (dynamic block quantizer detection fix, _global_amax fallback for restored static quantizers, per-layer quant config collection for mixed-precision export, QKV exclude expansion) are well-reasoned and the new unit tests cover the key behaviors.
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
What does this PR do?
Type of change: New recipe + Bug Fixes
hf_quant_config.jsonExport bug fixes
MCore and MSE fixes
_global_amaxfield inNVFP4QTensorstatic quantizer detection -> fixes bug during MCore export for MSENVFP4QTensor(not TensorQuantizer which can call max calibrate. we want to skip max calibrate for static quantizer during restore)block_sizesis dict-backed.Super recipe
Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:
rest: not quantized
Usage
# Add a code snippet demonstrating how to use thisTesting
TODO test in HF and MCore PTQ on Nemotron model
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Improvements
Tests